Mining for Domain-specific Parallel Text from Wikipedia
نویسندگان
چکیده
Previous attempts in extracting parallel data from Wikipedia were restricted by the monotonicity constraint of the alignment algorithm used for matching possible candidates. This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text. The algorithm ranks the candidate sentence pairs by means of a customized metric, which combines different similarity criteria. Moreover, we limit the search space to a specific topical domain, since our final goal is to use the extracted data in a domain-specific Statistical Machine Translation (SMT) setting. The precision estimates show that the extracted sentence pairs are clearly semantically equivalent. The SMT experiments, however, show that the extracted data is not refined enough to improve a strong in-domain SMT system. Nevertheless, it is good enough to boost the performance of an out-of-domain system trained on sizable amounts of data.
منابع مشابه
Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora
In this article, we present an automated approach of extracting English-Bengali parallel fragments of text from comparable corpora created using Wikipedia documents. Our approach exploits the multilingualism of Wikipedia. The most important fact is that this approach does not need any domain specific corpus. We have been able to improve the BLEU score of an existing domain specific EnglishBenga...
متن کاملAisles through the Category Forest - Utilising the Wikipedia Category System for Corpus Building in Machine Learning
The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on gen...
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملA language-independent method for the extraction of RDF verbalization templates
With the rise of the Semantic Web more and more data become available encoded using the Semantic Web standard RDF. RDF is faced towards machines: designed to be easily processable by machines it is difficult to be understood by casual users. Transforming RDF data into human-comprehensible text would facilitate non-experts to assess this information. In this paper we present a languageindependen...
متن کاملParallel-Wiki: A Collection of Parallel Sentences Extracted from Wikipedia
Parallel corpora are essential resources for certain Natural Language Processing tasks such as Statistical Machine Translation. However, the existing publically available parallel corpora are specific to limited genres or domains, mostly juridical (e.g. JRC-Acquis) and medical (e.g. EMEA), and there is a lack of such resources for the general domain. This paper addresses this issue and presents...
متن کامل